Add some unit tests for sled-agent Instance creation #4489

lifning · 2023-11-11T08:47:47Z

Depends on #4325 for faking zone creation.

At time of writing, instance creation roughly looks like:

nexus -> sled-agent: instance_put_state
- sled-agent: InstanceManager::ensure_state
  - sled-agent: Instance::propolis_ensure
    - sled-agent -> nexus: cpapi_instances_put (if not migrating)
    - sled-agent: Instance::setup_propolis_locked (blocking!)
      - RunningZone::install and Zones::boot
      - illumos_utils::svc::wait_for_service
      - self::wait_for_http_server for propolis-server itself
    - sled-agent: Instance::ensure_propolis_and_tasks
      - sled-agent: spawn Instance::monitor_state_task
    - sled-agent -> nexus: cpapi_instances_put (if not migrating)
  - sled-agent: return ok result
nexus: handle_instance_put_result

Or at least, it does in the happy path. #3927 saw propolis zone
creation take longer than the minute nexus's call to sled-agent's
instance_put_state. That might've looked something like:

nexus -> sled-agent: instance_put_state
- sled-agent: InstanceManager::ensure_state
  - sled-agent: Instance::propolis_ensure
    - sled-agent -> nexus: cpapi_instances_put (if not migrating)
    - sled-agent: Instance::setup_propolis_locked (blocking!)
      - RunningZone::install and Zones::boot
nexus: i've been waiting a whole minute for this. connection timeout! handle_instance_put_result
- [...]
- sled-agent: return... oh, they hung up. :(

To avoid this timeout being implicit at the Dropshot configuration
layer (that is to say, we should still have some timeout),
we could consider a small refactor to make instance_put_state not a
blocking call -- especially since it's already sending nexus updates on
its progress via out-of-band cpapi_instances_put calls! That might look
something like:

nexus -> sled-agent: instance_put_state
- sled-agent: InstanceManager::ensure_state
  - sled-agent: spawn {
    - sled-agent: Instance::propolis_ensure
      - sled-agent -> nexus: cpapi_instances_put (if not migrating)
      - sled-agent: Instance::setup_propolis_locked (blocking!)
      - sled-agent: Instance::ensure_propolis_and_tasks
        
        sled-agent: spawn Instance::monitor_state_task
      - sled-agent -> nexus: cpapi_instances_put (if not migrating)
      - sled-agent -> nexus: a cpapi call equivalent to the handle_instance_put_result nexus currently invokes after getting the response from the blocking call

(With a way for nexus to cancel an instance creation by ID, and a timeout
in sled-agent itself for terminating the attempt and reporting the failure
back to nexus, and a shorter threshold for logging the event of an instance
creation taking a long time.)

Before such a change, though, we should really have some more tests around
sled-agent's instance creation code at all! So here's a few.

illumos-utils/src/running_zone.rs

sled-agent/src/instance.rs

At time of writing, instance creation roughly looks like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - `illumos_utils::svc::wait_for_service` - `self::wait_for_http_server` for propolis-server itself - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: return ok result - nexus: `handle_instance_put_result` Or at least, it does in the happy path. omicron#3927 saw propolis zone creation take longer than the minute nexus's call to sled-agent's `instance_put_state`. That might've looked something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (*blocking!*) - `RunningZone::install` and `Zones::boot` - nexus: i've been waiting a whole minute for this. connection timeout! - nexus: `handle_instance_put_result` - sled-agent: [...] return... oh, they hung up. :( To avoid this timeout being implicit at the *Dropshot configuration* layer (that is to say, we should still have *some* timeout), we could consider a small refactor to make `instance_put_state` not a blocking call -- especially since it's already sending nexus updates on its progress via out-of-band `cpapi_instances_put` calls! That might look something like: - nexus -> sled-agent: `instance_put_state` - sled-agent: `InstanceManager::ensure_state` - sled-agent: spawn { - sled-agent: `Instance::propolis_ensure` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent: `Instance::setup_propolis_locked` (blocking!) - sled-agent: `Instance::ensure_propolis_and_tasks` - sled-agent: spawn `Instance::monitor_state_task` - sled-agent -> nexus: `cpapi_instances_put` (if not migrating) - sled-agent -> nexus: a cpapi call equivalent to the `handle_instance_put_result` nexus currently invokes after getting the response from the blocking call (With a way for nexus to cancel an instance creation by ID, and a timeout in sled-agent itself for terminating the attempt and reporting the failure back to nexus, and a shorter threshold for logging the event of an instance creation taking a long time.) Before such a change, though, we should really have some more tests around sled-agent's instance creation code at all! So here's a few.

lifning · 2024-03-05T07:38:55Z

closing this because it makes more sense to just pull it in as part of #4691

lifning force-pushed the sled-agent-instance-creation-tests branch 3 times, most recently from dfdfe77 to 0b96df7 Compare November 14, 2023 03:52

jordanhendricks self-requested a review November 14, 2023 20:44

lifning force-pushed the sled-agent-instance-creation-tests branch from 0b96df7 to 8c23c45 Compare November 15, 2023 05:11

jordanhendricks reviewed Nov 28, 2023

View reviewed changes

illumos-utils/src/running_zone.rs Outdated Show resolved Hide resolved

illumos-utils/src/running_zone.rs Show resolved Hide resolved

sled-agent/src/instance.rs Outdated Show resolved Hide resolved

lifning force-pushed the sled-agent-instance-creation-tests branch 3 times, most recently from 74d4b87 to bdeb287 Compare December 1, 2023 23:54

lifning mentioned this pull request Dec 14, 2023

sled-agent: don't block during instance creation request from nexus #4691

Merged

lifning force-pushed the sled-agent-instance-creation-tests branch 2 times, most recently from e6e7db7 to 2c22a2e Compare December 21, 2023 09:57

lifning force-pushed the sled-agent-instance-creation-tests branch from 2c22a2e to efb04a8 Compare December 21, 2023 20:36

morlandi7 linked an issue Jan 26, 2024 that may be closed by this pull request

Propolis zone installation took 81 seconds and caused instance start to time out #3927

Closed

lifning closed this Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add some unit tests for sled-agent Instance creation #4489

Add some unit tests for sled-agent Instance creation #4489

Uh oh!

lifning commented Nov 11, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifning commented Mar 5, 2024

Uh oh!

Uh oh!

Add some unit tests for sled-agent Instance creation #4489

Add some unit tests for sled-agent Instance creation #4489

Uh oh!

Conversation

lifning commented Nov 11, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lifning commented Mar 5, 2024

Uh oh!

Uh oh!